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Abstract 



Prompted by the need to simulate large molecular or gravitational systems 
and the availability of multiprocessor computers, alternatives to the standard 
Ewald calculation of Coulombic interactions have been developed. The two 
most popular alternatives, the fast multipole method (FMM) and the particle- 
particle particle-mesh (P^M) method are compared here to the Ewald method 
for a single processor machine. Parallel processor implementations of the P^M 
and Ewald methods are compared. The P'^M method is found to be both faster 
than the FMM and easier to implement efficiently as it relies on commonly 
available software (FFT subroutines). Both the Ewald and P'^M method are 
easily implemented on parallel architectures with the P^M method the clear 
choice for large systems. 



I. INTRODUCTION 

The evaluation of Coulombic interactions for large systems is a common computational 
problem. In biomolecular systems the scale of the structures, for example biological mem- 
branes, often require simulating large systems. Many algorithms have been used for this. 
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Here three of the most common, the Ewald |T|], particle-particle particle-mesh, P^M 0, 
and fast multipole method, FMM, are commented on and compared for single processor 
and (for the first two methods) multiprocessor computers. Our aim is to suggest areas of 
application for each method and provide some guides to their implementation. 

Our observations are the result of applying these methods to condensed matter problems 
rather than a mathematical or algorithmic interest in the methods themselves. This explains 
some of the omissions and qualitative nature of much of the discussion. For example the 
implementation for multiprocessor computers was directed to the study of molecular systems 
where the immediate goal was to evaluate the long range interactions in a time comparable 
to the short range interactions. 

Section two presents the Ewald and P^M [|], [Q methods together since they are very 
similar. Appendices A and B give a compilation of necessary formulas for the P^M method 
and a discussion of parameter selection. 

Section three, together with appendix C, discusses the FMM method and how it can be 
efficiently implemented on single processor machines. Although the operations count for this 
method scales linearly with the number of charges, different codes have shown a variation of 
two orders of magnitude in the number of charges required for this method to exceed Ewald 
in speed. 

Section four discusses single processor implementations of the three methods starting 
with a discussion of accuracy for P^M methods. Timings for the three methods are then 
compared. Finally the relative advantages of these methods in treating non-cubic peri- 
odic cells, alternate boundary conditions, non-Coulombic interactions, and two dimensional 
systems are considered. 

Section five presents parallel implementations of the Ewald and P^M methods (specifi- 
cally on the CRAY T3D). 
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II. THE P^M AND THE EWALD METHOD 



The P^M method is closely related to the Ewald method so we consider the two together 
here to give the usual heuristic derivation of both methods. 

The total electrostatic potential energy for a system of point charges 
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is rewritten by adding and subtracting a term corresponding physically to the electrostatic 
energy of a system of smooth spherical charges, with a density Pi{r), centered on the particle 
positions to obtain P 



U 



1 N N 
1 N N 

+^EE 

^ i=i j=i 



ZiZj 



drdr' 



Pi{r)pj{r') 



1 ^ 

-y 

^ i=i 



Piir)piir') 



(2) 



The first, bracketed, term now corresponds to particles interacting through a short- 
ranged interaction which is zero beyond the overlap of pi pj. The second term corresponds 
to the Coulomb energy of a smooth charge distribution p{r) = J2f Pii^') and the last term 
is a constant self-energy. 

The Ewald formula uses a Gaussian for Pi{r) 



p,{r) = Z,{Gy7rf/'exp[-G\r-r 



(3) 



The P'^M method allows any choice for pi but we have found no advantage in the usual 
alternative choices and will use the Gaussian form throughout. The interaction of Gaussian 
shaped charges can be evaluated analytically and gives rise to an error function. Doing this 
the potential energy becomes 
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The treatment of the remaining term now distinguishes P^M from Ewald. The Ewald 
formula results from an exact, analytic evaluation of the term 

p(r)p(r') Ajf 

^^^^^ drdr' = nY: ^Pim-k) (5) 



which together with 
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where the charge structure factor S{k) = Y^^i Zie^^''^\ gives 

f2 is the periodic system volume. 

A prescribed accuracy requires a cutoff Tc oc l/G, so 0{Npr^^) ~ 0{N'^/G^Q) operations 
for the first term in the above equation, and kmax oc G for the last term so 0{G^Q) k vectors 
and 0{NG^Q) operations, since computing each S{k) requires 0{N) operations. Varying 
G^il to minimize the total number of operations gives the optimal G^il oc -\/iV and the 
familiar N^^"^ scaling of computation time with the number of charges. 

The P^M method follows from treating eqn. ^ by numerical methods. Essentially the 
density p(r) is assigned to a grid and then p{k) computed by an FFT. Computing the 
electrostatic forces on each particle 

F, = -V,f/ = -J p,{r)vHr)dr (8) 

also requires transforming the field ik$(/c) back to real space. 

Again evaluating the first ("real space") term in eqn. ^ to a given precision requires 
Tc oc 1/G or 0{N'^ /G^Q) operations. For an accurate numerical evaluation of eqn. ^ (the 
"reciprocal space" term) a grid that accurately resolves pi is necessary. From eqn. |^ the 
width of Pi ~ l/G so a grid spacing A such that 1/GA ^ n, where n is some number (say 8 
or greater), is needed. This gives G^Qn^ grid points. The FFT to solve the Poisson equation 



uses 0{G^Qn^\n{G^Qn^)) steps and forming the density takes 0{N)n^ steps. Varying G^fl 
to minimize the time gives the optimal G^Q oc and time scahng of 0{Nln{N)). If other 
considerations govern the choice of G then P^M still scales as 0{Nln{N)) while Ewald 
deteriorates to 0{N'^). Note also that the optimal G now is constant as increases at fixed 
density. By contrast the optimal G for the Ewald method decreases at constant density 
as N~^/^ implying a longer ranged "real space" interaction and thus memory for neighbor 
tables increasing as N^^'^ . 

This straight forward implementation of P'^M, referred to below as primitive P^M, is 
practical and clearly far preferable to Ewald evaluations for large systems. It is found (as 
already suggested by the choice n 8 above) however that the density of a single particle, 
Pj(r), must often be spread over several hundred grid points. Clearly distributing each 
charge to fewer grid points would yield a still faster algorithm. 

Hockney and Eastwood suggested using a different, narrower, density (or assignment 
function) for this last term and compensating by modifying the Coulomb Green's function. 
The basic idea may be seen in rewriting eqn. ^ 



n^%\m?-a^f,^^\wik)? (9) 



where the "assignment function" W{r) = J2iLi Wi{r) would be narrower than pi making it 
easier to form the "density" and to interpolate the forces. The Coulomb Green's function is 
thus modified to 

" \W{k)\^ 

Since this exact compensation requires p{k) and W{k) which vary for each configuration 
nothing has been gained. Instead Hockney and Eastwood use a modified Coulomb Green's 
function which minimizes the mean squared error in the forces (due to the new assignment 
function and also finite size grid errors) for charges uniformly distributed in the cell. The 
resulting formula as well as those for the assignment functions are given in Appendix A. 



Details may be found in Hockney and Eastwood or more concisely in 0] and we give results 
below demonstrating the final errors associated with various assignment schemes. 
The steps in the P^M method can now be summarized as: 

• Compute the short ranged terms. 

• Form an effective density W{r) = Y^iLi ^^4(^)5 where specific forms for various Wi{r) 
which assign the density to = 3, 4, 5, . . . grid points in each dimension are given in 
appendix A. 

• Using the modified Coulomb Green's function, also given in appendix A, solve Poisson's 
equation to get the potential and electric fields due to the effective density. 

• Finally, interpolate the fields back to the particles ( eqn. ||) using the Wi{r). 

A discussion on the selection of the assignment order, grid size, and parameter G is given 
in appendix B. 



III. DESCRIPTION OF FMM 

The Fast Multipole Method primarily due to Greengard and Rokhlin P] has been de- 
scribed many times. A concise description is given in a paper discussing its first three 
dimensional implementation P] and an intuitive overview of the reasoning behind the steps 
in ref. 0. Here we review this description and stress the efforts needed to make it efficient. 
The importance of the method is its 0{N) scaling with the number of charges. 

The FMM method for calculating Coulomb interactions is based on two related expan- 
sions: the multipole expansion 

Imax ][f. Yi ir) 
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where the multipole moment for charges is 
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which converges for r > max{r.}, and the local expansion 
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which converges for r < min {r.}. The purpose of the FMM is to calculate the local expansion 
coefficients due to charges at some distance from the point of interest and to account for the 
closer charges by a direct summation. 

This is done in five steps. The simulation cell is successively subdivided. The largest 
division is the cell itself. This is divided into say eight cubes (for example). Each of these 
cubes is then subdivided and so on for a prescribed number of subdivisions, L, so that at 
the finest subdivision there are 8"^ cells. The five steps are now: 

1. Compute multipole moments (using eqn. 0) for all cells at final level of subdivision 
(smallest cells); 

2. Sweep up from smallest cells to largest cell to get multipole moments for cells at all 
subdivision levels using the formula to shift the origin of a multipole expansion given 



3. Sweep down from largest (single) cell to smallest cells to get local expansion coefficients 
in the smallest boxes according to the repeated sequence; 

a Transform local expansion of larger cell to cells at next level of subdivision using 
the formula for shifting the origin of a local expansion given below; 

b Add to these local expansion coefficients the contribution from cells at next level of 
subdivision which have not been included yet and which are not near neighbors of 
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below; 



the cell being considered. This uses the multipole moments of these cells according 
to a formula given below. The first time this is done non-nearest neighbor images 
of the simulation cell will be included for the case of periodic boundary conditions. 

4. Once the preceding step has reached the finest subdivision level evaluate the potential 
and fields for each particle using the local expansion coefficients for the (smallest) cell 
containing the particle. 

5. Add the contributions from other charges in the same cell and in near neighbor cells 
(their contribution is not in the local expansion coefficients) by direct summation. 

The algebra for these steps consists of convolution type sums. With the notation 

(-i)'+"^v^ITI 

aim = (15) 

^J4:^T{l + m)\{l-m)\ 

and replacing the multipole moments and local expansion coefficients by 

(16) 

aim 



the needed formulae for the steps are [ITO 



1. Calculate smallest cell multipole moments from the definition; 

2. for shifting the origin of a multipole expansion 

I' I 
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where R points from the old to new cell origin; 
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3. for shifting the origin of a local expansion use 



C^vm' =E E - ^ - ^')^im (20) 

1=1' m=-l 



t^^(/,m) = YUK)R' (21) 

2/ + 1 

and for adding multipoles to a local expansion use 

4„^' = (-1)'+™' E E t^'^il + ^' - m)Mim (22) 
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m) = Y:^{K)/R'+' (23) 

where again R points from the origin of the multipole calculation to the origin 
for the local expansion. 



The spherical harmonics are those of Jackson's Classical Electrodynamics [|ri[| . A discussion 
of the efficient implementation of this algorithm is given in appendix C. 

IV. NUMERICAL RESULTS AND COMPARISON OF THE METHODS ON A 

SINGLE PROCESSOR 

A. Comments on accuracy of the P^M method (discretization error) 

Although an extended discussion of P^M accuracy is found in ref. we present here 
some indicative results to aid in deciding which variant is necessary for a desired accuracy. 
For the Ewald formula, eqn. 0, convergence of the k-space part clearly depends on the value 
of GA, where A is the grid spacing corresponding to the largest k vector. 

This is also true for primitive P^M, where l/GA corresponds to the number of grid 
points within a Gaussian particle and thus controls the discretization error. For P^M where 



the Gaussian density is replaced in the particle-mesh calculation by an assignment function 
with a modified Coulomb Green's function the situation is less clear intuitively . Figure 1 
shows this dependence on GA. 

Figure 1 shows the relative accuracy of the k-space part of the forces for the various 
assignment schemes and for primitive P^M.(The results graphed are for 512 randomly placed 
charges in a periodic cube and a 32^ grid was used, but the semi-quantitative features of the 
figure are insensitive to these details.) The reference forces used in determining the relative 
error, for this figure were from a well converged Ewald calculation. 

Several observations can be made from the results of figure 1: 

• Primitive P^M (labeled "S3") is here the most accurate but its drawback is the large 
number of grid points (shown above the x-axis) required to describe the Gaussian if 
G is small. (The number of grid points displayed corresponds to a cutoff of 10~^ on 
the Gaussian density). By contrast, for the various assignment schemes (Labeled by 
n) only n^ grid points per charge are needed. (Since the charge density assignment 
factors as a product of x,y, and z values most of the operations involved in assigning 
the charge to the grid scale as n rather than n^.) 

• Each increase in n gives almost an order of magnitude increase in relative accuracy. 
Most condensed matter simulations aim for a relative accuracy of at least 10""^. Al- 
though present day biomolecular force fields are very approximate this sort of accuracy 
would be desirable for potential or free energy comparisons between configurations or 
phases. 

• The results shown are for a random configuration of charges. Since the Hockney 
and Eastwood formula for the modified Coulomb Green's function is based on such a 
configuration it might be suspected that these configurations would give the highest 
accuracy. This is true but even for a highly ordered configuration (perturbed lattice) 
the reduction in accuracy was less than a factor of two. 
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• As mentioned above the figure is insensitive to the number of particles in the periodic 
cell and the grid size. For example, the same plot for a 64^ grid looks very similar. 

B. Timing Comparisons 

The Ewald, P^M, and FMM algorithms have been used to calculate the forces and 
potential energy for a random periodic configuration of N=512, 1000, 5000, 10000, and 
20000 charges. Total timings and other details are given in Table I. The distribution of the 
total time over the various operations is shown for the P^M method in Table II and for 
FMM in Table III. The computations were done on an IBM RS/6000 590 workstation using 
the ESSL mathematical subroutine package for the FFT. 

Such comparisons depend strongly, of course, on the effort put into optimizing the coding. 
As already discussed this is particularly true of the FMM algorithm. These comparisons are 
therefore only semi-quantitative. With this caveat in mind several trends are worth noting: 

• The P^M is roughly four times faster than the FMM algorithm for all N shown in the 
table. (An extrapolation based on the scalings for the two methods suggest that FMM 
only becomes faster at some unphysical size > 10^°.) 

• Even for the N=512 system the P^M is faster than the Ewald method. (We estimate 
this crossover occurs for A^ < 50). The FMM code used here is faster than the Ewald 
for 800 particles. This indicates the optimization effort put into this code since 
crossovers in excess of 50,000 particles have been reported for these two methods. 

C. Other Contrasts between P^M and FMM 

Although P'^M is faster than FMM the choice between the two may be dictated by other 
considerations. These include: 

• Ease of Coding 

The P^M method is considerably easier to code than the FMM. The Poisson solver for 
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periodic systems is based on available FFT software. The assignment of the particle 
charge to the grid and the interpolation of the electric field from the grid to the particles 
are both straight forward loops over the particles. 

• Non-cubic periodic cells 

Treating non-cubic, parallelepiped periodic cells is straight forward with the P^M 
method using the corresponding non-cubic grid. This situation is more complicated 
for the FMM since the convergence of the expansions depends on the ratio of distance 
from the origin to the distance of the nearest charge included in the expansion. For 
cells separated by one or more intervening cells this ratio is now anisotropic and, at 
the least, additional bookkeeping is required. 

• Alternate Boundary Conditions 

The simulation of a cluster of charges (vacuum rather than periodic boundary condi- 
tions) or a slab periodic in only two dimensions is natural with the FMM and only 
involves omitting a step (discussed under step 3b in section 3). 

The P^M algorithm can also be modified to treat these cases by cutting off the Coulomb 
potential (either spherically for a cluster or in the transverse direction for the slab) 
at a distance large enough to correctly include all interactions in the cluster or slab 
but short enough to eliminate interactions with any periodic images as suggested in 
i and H. 

Specifically the development of section two for a cluster or slab proceeds to eqn. ^ 
exactly as before and again the remaining integral is to be evaluated numerically by 
Fourier methods. As already stated this is done by first cutting off the Coulomb 
potential at |r — r'| sufficiently large that p{r)p{r') is zero and taking a periodic cell 
large enough to insure that the periodic images implied by a Fourier treatment do not 
interact. Note however that the density p{r) consists of Gaussians on each particle 
and thus extends somewhat beyond the cluster or slab boundaries. 
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For a spherical cluster the potential can be cutoff at a distance i?c, which must ex- 
ceed the cluster diameter by several Gaussian widths, l/\/2G, in order to include all 
interactions within the cluster. The ATr/k"^ in eqn. is then multiplied by the form 



factor [1 — cos{kRc)] but otherwise unchanged. Taking the periodic cell length as 2i?c 
or greater, to avoid interactions between periodic images the computation proceeds as 
before with similar accuracy. 

For the slab the Coulomb potential can be cutoff when z exceeds Z^. which must exceed 
the slab width, as before, by several Gaussian widths. The An/k"^ in eqn. |A5| is now 
multiplied by the form factor 

1 - exp{-k±Zc) (cos{k;,Zc) - ^ sin(A;^Zc) j (24) 



where k± = ^k^ + ky. The last term, {kz/k±) sin{kzZc), is singular at fc^ = but this 
term can be shown not to contribute to the potential as kj_ and can be omitted 
for those k values. (Of course, the k = value contributes nothing for charge neutral 
systems). With this procedure and again taking the periodic cell length along z large 
enough that the slab images are separated by at least Zc the method gives accuracy 
comparable to the fully periodic case. 

In sum, treating a cluster or slab with the P^M algorithm involves a one time mod- 
ification to the optimal Coulomb's Green's function and the use of a cell somewhat 
more than twice as large as the system dimension in either three or one dimensions. 
This only affects the FFT times and for similar accuracy the P^M is still considerably 
faster than FMM. 

Non-Coulomhic Interactions 

The P'^M method proceeds similarly for any isotropic Fourier transformable poten- 
tial. (Short-range repulsion can be treated separately so that the remainder is trans- 
formable). For a dipolar potential the vector Gaussian dipolar density is assigned to 
the grid and the electric field again computed by Fourier methods. 
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Dimensionality 

Two dimensional systems may be easily treated by both algorithms. 



V. PARALLEL IMPLEMENTATION OF EWALD AND P^M 

In this section parallel implementation of the k-space part of eqns. |^ and ^ is discussed 
for the Ewald and the P^M method. The real-space part is a reasonably short-ranged pair 
interaction and parallel algorithms for these interactions have been extensively discussed 
in the literature [Q. The results we use for illustration were computed on the T3D using 
shared memory constructs but the remarks apply almost unchanged to any distributed 
memory machine or message passing system. For discussion of a parallel FMM see M 



A. Ewald 



This method is ideal for parallel implementation as very little interprocessor commu- 



nication is required and several implementations have been presented |jT^. The required 
structure factors are rewritten as 

S{k) = Y: Z,e^'-^ = E E e''-^' = E Sp{k) (25) 
j P jeP P 

where P denotes processors. Particles are distributed to processors and the partial structure 
factor Sp{k) computed on each processor for all k vectors. These partial structure factors 
are then summed across processors to obtain the total S{k). The sum over k vectors (eqn. ^ 
to obtain the total energy U can be done on one or all processors. Computing the forces 
(where the summand [^(/c)^ is instead i'ke^'^'^^ S [k)) involves no further interprocessor com- 
munication. It is preferable to divide the particles among processors and have each processor 
handle all k vectors since then the usual complex multiplication can most easily be used to 
build up the necessary e*(''i+''2)-»-j = ^iki-rj ^ik2-rj _ 

Figure 2 shows the time required to evaluate the energy and forces versus the number of 
processors for N=10240 particles and G^l^^^ = 8.3. Several observations can be made: 
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• As already stated, scaling of the time with number of processors is almost ideal (oc 
1/NPE). The interprocessor communication time was always significantly less than 
.1% of the total. 

• If Gn^/'^ is scaled as iV^/^ to minimize the time (discussion in section 2) then the timings 
scale as N^^^. Results for N=1024, GQ^^^ = 5.6, and N=102400, GQ^/'^ = 12.2 are not 
shown since the scaled timings overlap the results shown. 

For systems of the order of 10^ charges and a hundred processors the Ewald method is 
a good choice. For larger systems the N^/'^ Ewald scaling dictates that the P^M algorithm 
should be used. 



B. P^M 

Our replicated data, parallel version of the P^M algorithm follows the steps listed at the 
end of section 2. 

• A discretized effective density 

N 

W{r,) = E W,ir,) = Y.i: W,ir,) (26) 
j=i p jeP 

is formed at the grid points, denoted by r^, by first summing the contribution of 
particles on each processor and then summing over processors to get the total effective 
density which we store on all processors. 

• This density is then partitioned for use with a distributed data FFT. Since the effective 
density is stored on each processor no interprocessor communication is required here. 
The total energy is calculated by first summing the distributed Fourier components of 
the effective density and potential on each processor and then summing over processors. 
To compute the electrostatic force on the particles (eqn. the electric field on the 
grid is reassembled (on the T3D a shared memory construct, shmem_f collect, does this 
operation) and stored on each processor. 
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• Finally eqn. with W replacing p, is evaluated for the particles on each processor 
to get the forces. To save memory the same array was used to successively store the 
electric field components although this requires duplicating the single particle density 
calculations increasing the overall time by roughly 20 %. 

This replicated data implementation has several advantages: The coding is straight- 
forward and all message passing is hidden in global sums or other supplied routines (e.g. 
shmem_fcollect); The assignment of particles to processors is arbitrary although a compara- 
ble number per processor is desirable for load balance. It has the disadvantage of requiring 
more memory than a purely distributed data implementation and involves interprocessor 
communication (e.g. in forming the total density and the electric fields) that could be 
minimized if the particles were initially sorted on processors in a domain decomposition cor- 
responding to the grid decomposition used by the FFT. As seen in the illustrative timings 
shown below this extra effort would be worthwhile for smaller systems on a large number of 
processors where the fraction of time used in interprocessor communications is significant. 
The replicated data version is thus limited to less than a few hundred processors. The 
method is clearly well adapted to a distributed memory, domain decomposition approach to 
remove this limit. In this parallel version advantage has also not been taken in the FFTs of 
the fact that the charge density and the electric fields are real. 

Figure 3 illustrates timing trends for the n = 4 assignment scheme for systems of 102,400 
and 1,024,000 particles. These timings are only indicative with the same 64^ grid used for 
both systems. The total time (upper solid line) is broken into five, cumulative, components: 

• The time required to tabulate the effective density, W{rg), (lower solid line indicated 
by Density arrow) due to the particles on each processor. This step involves no com- 
munication and scales linearly with the number of particles per processor. 

• The time to get the total W{rg) by summing across processors (second solid line) 

• The time spent in the FFT calculations for the discretized potential and electric field 
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terms (solid line indicated by Poisson arrow) 

• The time spent summing Fourier components to get the contribution to the total 
electrostatic energy. This time is not significant and is indiscernible in the figure 
appearing as a broadening of the previous solid line. 

• Finally the time spent calculating the forces (interpolating the electric field from the 
grid to the particle positions) is indicated on the right by the Ei arrow. This time is 
roughly three times that required to form W{rg) since similar steps are required but 
now for three components. Again no interprocessor communication is involved and 
this step scales ideally. 

The dashed line shows the interprocessor communication time from component two (sum- 
ming W{rg) across processors) and collecting the electric field components included in com- 
ponent five. (Communication time from the FFTs is included in Poisson). As the time 
for the rest of the calculation decreases linearly with the number of processors this slowly 
increasing component, (it almost doubles in going from 8 to 256 processors) becomes im- 
portant rising to 75% of the total time for N=102,400 and 256 processors. For N=l,024,000 
the communications times are about the same but less important. 

For 102,400 charges and 32 processors the P^M method is approximately 50 times faster 
than Ewald and is several hundred times faster for the 1,024,000 charge case. 

VI. SUMMARY 

In this paper it has been argued that the Ewald method is suitable for systems of a 
few hundred particles per processor. For larger systems the P^M algorithm is increasingly 
more efficient. The FMM is a second choice for all system sizes both in terms of speed and 
program complexity. Considerations such as the relative advantages gained in not updating 
the field due to distant particles every molecular dynamics step or the use of accelerated 
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series convergence methods ||T6[ with the FMM have not been addressed but seem unhkely 
to alter these conclusions. 
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APPENDIX A: P^M FORMULAE 

Some useful formulas for the P^M method are collected here starting with the assignment 
functions, Wn{x), used to create a density and to interpolate the forces. The subscript 
on Wn{x) refers to the number of grid points (along each coordinate axis) "supporting" 
a charged particle and x is the distance from the particle to the grid point measured in 
grid spacings. The full assignment function is the product along each direction, W{r) = 
W {x)W {y)W (z) . The Wn are zero outside the indicated range and are given by successive 
convolutions, Wn+i = Wn * Wi. These assignment functions satisfy 

/ Wn{x)dx = 1 and E°l-oo Wn{x + j) = 1 (Al) 
The first five Wn{x) are: 
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Wi{x) = 1 |x| < 1/2 



Woix) = 1-1x1 \x\ < 1 



WJx) 



3/4 -x^ \x\<l/2 
i(3/2- \x\)^ 1/2 < |x| < 3/2 

2/3 - + |x|V2 |x| < 1 
(2-|x|)V6 l<|x|<2 



(A2) 



W,(x) = I 



(115 - 120x2 + 48x^)/192 |x| < 1/2 

(55 + 20|x| - 120x2 + 80|x|3 - 16x^)/96 1/2 < |x| < 3/2 
(-5 + 2|x|)V384 3/2 <|x|<5/2 

Higher order functions, if needed, can be easily obtained as 



n-l 



Wn{x) = J2A4l,j){x-j/2y for -l<x-^-<l 

1=0 Z A A 



(A3) 



where j goes from — (n — 1) to (n + 1) in steps of 2. The An{l,j) satisfy the recursion 
relations: 



A„(/,j + l)-A^(/,j-l) 
/ + 1 



^„+i(0,j) = ^(2) 



(A4) 



/ + 1 



starting from Ai{0, 0) = 1.0 . 

To minimize the error associated with using the assignment function W{r) in lieu of p(r) 
as well as aliasing errors Hockney and Eastwood derive for the modified Coulomb Green's 
function iHl 



19 



47r 



k • (k + b) 

— ^w^„^(k + b)p^(k + b) 



G{k) = _ ^ ^ + 
A;2 



(A5) 



E^n(k + b) 



where the usual Fourier wave vectors, k = 27rj/L with j — 1 . . . M ior M grid points on 
a periodic cell length L and the Brillouin zone vectors b = 27rl/A where / = —oo, oo and 
A — L/M. This expression needs to be evaluated only once. 

Some ingredients in this expression are the Fourier transform for the assignment function 



sin(A;A/2) ^ 



V 



A;A/2 



and 



p{k) = e 



The sum in the denominator is evaluated using the identity 

oo -j^ 

cot(x) = ^ 

introduced by Hockney and Eastwood, so 



(A6) 



(A7) 



(A8) 



Sn{k) ^Y^w^ik + b) = [Mk A/2)r E ^ 

b j=-^ [kA/2 + njf'' 



[sin(A;A/2)] 



2n 



cot(x] 



(2n- 1)! 

Denoting z — sin (/c A/2) some algebra gives 

(1 - 2^73) 
{l-z^ + 2^715) 
(1 - 4^73 + 2^^/5 + 4^7315) 



Sn{k) = 



x=kA/2_ 



n = 2 
n — 3 
n = 4 



;i - 5^73 + 7zy9 - 17;s7l89 + 2^72835) n = 5 
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(A9) 



(AlO) 



Higher order Sn{k), if needed, are given by 

n-l 

^n(fc) = E^n(0^'7r(2n) (All) 
1=0 

where the satisfy: 

bnil) = 4[6„-i(/)(n -l-l){n-l- 1/2) - 6„_i(/)(n - If] ^ ^ 

(A12) 

6„(0) = 46„_i(0)(n-l)(n-l/2) 
starting from &i(0) = 1. 



The full S'„(k) = Sn{kx) Sn{ky) Sn{kz) ■ The numerator of eqn. ^5| converges rapidly and 
is evaluated numerically. 



APPENDIX B: TIMING AND SELECTION OF PARAMETERS FOR P^M 

The time required to compute the total energy and the electric field at each particle in 
an N particle system using P'^M with a real space cutoff of Tc, order n assignment scheme, 
and a grid of A^^^ points may be expressed as 

T = ai ( yP^c) N + {a2N + a^Nn^) + {a^Ng In A^^ + a^Ng) . (Bl) 

The first term gives the time for the particle-particle interactions, the second term the time 
to form the density and to interpolate the fields from the grid to the particle positions, and 
the last term the time to do the FFTs. 

Assuming that a desired precision is specified, the allowed real space (particle-particle) er- 
ror, eR^Gvc), (which to a good approximation can be fitted by ~ exp(— G^r^/2)/v^Grv ), 
gives = c/j(e/j) and the allowed k-space error, = F[GA,n], (see figure 1), implies 
GA = Ck{n, efc) or G^Q/Ng = c|. Using these relations to express Vc and Ng in terms of G, 
the time can now be varied to find the optimal G and n. 

Variation with G, OT/dG = 0, imphes 

+ a,^ (B2) 
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4:71 C% 



, G^n ^ 

In^^ + 1 



so 



G 



1/6 



(B3) 



05 + a4[lni^ + 1] 

which, for a given n, can be quickly iterated to find G. The optimal n can be obtained by 
direct evaluation of T at the optimal G for n = 3, 4, 5 . . .. 

If the logarithmic variation of the FFT time is ignored, eqn. implies Tr Tppx at 
the optimal G. This gives a rule of thumb for adjusting G 



G~ (rpold Irpold n1/6 

new ~ ^oldyJ- R / F^'' 



'FFT) 



(B4) 



with corresponding adjustments in Tc and A^^ to maintain accuracy. Equations BT and B3 
also imply that T ~ iV[a + 61n A^]!/^ |T^. 

For many systems there are also moderate range forces (e.g. Van der Waals) to be 
calculated and the may be specified by these forces. The allowable real space error then 
determines G and the k-space error determines the necessary grid size Ng. The optimal n 
is determined empirically unless the {ai ... 05} are known. 



APPENDIX C: EFFECTIVE IMPLEMENTATION OF FMM 

Efficient implementation of the FMM typically requires more effort than for the more 
straight forward Ewald or P^M methods. We discuss in this appendix some of the technical 
details. 

The expansion order p = Imax + 1 is determined by the required accuracy e and the 
number of subdivision levels L is then varied to minimize the computing time. Truncation 
of the various expansions gives an error estimate of the form e ~ ca^, where the coefficient 
c and the geometry related ratio a have been determined empirically, to give the following 
relation for the required expansion order necessary for a desired accuracy 

-ln(100e)/ln(2) . (CI) 

The total time is the sum of the times for the five steps listed in the section III. 
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In step 1 the time to evaluate all multipoles at the finest subdivision 

Ti =aiArp(p + l)/2 (C2) 

where ai is the time to evaluate one term in Eq. [T^. (Note the Yim s can be calculated 
recursively.) 

In Step 2, Eq. 18 is use to shift the multipoles of the child (finer subdivision) to the 
box center of the parent (coarser subdivision). This is done for each box at all levels 
except for the highest level box. 

T, = ^p\p + l)\N,o.es-l) (C3) 

where Nboxes is the total number of boxes in the hierarchical tree (all levels) {Nboxes = 
(^8^+1 — l)/7 ), and 02 equals the time for one complex multiply and add. 

The time in Step 3a using Eq. ^ to transfer the local expansion of the parent to the 
center of the child 

Tsa = ^pHP+^)\Nboxes-l) (C4) 

The time in step 3b to convert the multipole expansion of the members of the inter- 
action list (at most 189 members) of to local expansion about the center of that box 
using Eq. |2^ 

T3b = jpHp + l)l89{Nboxes - 1) (C5) 

The time required in step 4 to evaluate the local expansion (Eq. [T3| for each particle 
in the system 

T, = ^Np{p+l) (C6) 
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Finally in step 5, the near field interaction is evaluated by summing up over all pairs 
in neighbors boxes. 



27 N 

= as-N- (C7) 



where as = time to evaluate one pair interaction. 
The total time is thus, 

T = a,Np{p + 1) + + l)(p + 1 + 189p)](iV,„,,, - 1) + «3yiV^ (CJ 

Minimizing with L gives an expression for the optimal value 



21{a^/a^){l/i 



(C9) 



^ 1 + 189p) 



This gives an optimal time of 



Topi = N 



aipip + 1) + J(216/7)a2a3p2(p + l)(190p + 1) 



(CIO) 



(where we have approximated Nf,oxes — 1 ~ 8^^^/ 7). 

Both terms in Topt scale as p"^ however the coefficient for the second term is much larger. 
In fact at L = Lopt almost all the time is divided equally between converting multiples to 
local expansions (step 3b) and the direct coulomb sums (step 5). 

Greengard observed that the transformation equations |18|, ^ |2^ are all of the form of a 
convolution and FFT's could be used to speed up their evaluation. To use Fourier transforms 
to evaluated these equations t'^^ , t^^, and t^^^ must be mapped to periodic function in 
such a way as not to change the the rhs of these equations. This can be done by padding 
these vectors with zeros. For example define Mp{l, m) for 1 in [0,2p-l] and m in [-2p,2p-l] 
as, 

M(Lm) for \m\ <= I J<p 
Mpil,m) = { (Cll) 

otherwise 
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Using FFT's equations |T^, |20|, |22| in Fourier space will take 

T2 = Tsa = a2{2p)\Nho,es ' 1) 



Tsb = a2i2p)h89{Nbo,es - 1) 

(Note we used the symmetry M{l,—m) = (— l)™M*(/,m) ) which in Fourier space 
implies Mp{l + 2p, —m) = Mp{p,m) ), This reduces the number of terms to evaluate by a 
factor of two) 

Steps 2, 3a and 3b can all be formed in Fourier space, so in principle all that is needed 
is to transform to multipole coefficient M at the finest level to Fourier space and the local 
expansion coefficient L also at the finest level back to real space. The time to do this, using 
symmetry is, 

Tfft = a22[8pHn{2p)]8'' (C12) 

Before writing down the timing for the FMM algorithm using FFT's a numerical issue 
needs to be considered. The vectors A4, C, t^'^ ,t^^, and t^'^^ are factorially varying func- 
tions of their indices. This large dynamic range results in lost of precision when performing 



the FFT and hence the Fourier formulation of equations IS, EG, 53 becomes unstable for 



large p. One way to reduce the dynamic range is by scaling. For example consider equation 



22| . If we introduce a scaling factor s we can rewrite equation ^ as, 

4^, = J2 s(-'')t^^'(/ + /', m' - m)s'-^''s-'Mim (C13) 

l,m 

by introducing Ls{l,m) = s''C'i^, Ms{l,m) = s'^Aiim and Ts[l,m) = sh^^'\l,m) 

Ls{l\m) =Y,Ts{l + l\m -m)Ms{l,m) (C14) 

l,m 

Now s can be chosen to minimize the dynamic range. The above transformation can also 
be performed in Fourier space, and will be more stable than the original FFT formulation 
of equation 22. This however, is not a complete cure but will stabilize the algorithm up 
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to p=16. The same procedure can be applied to equations |T^ and however a different 
scahng s must be chosen. This will require Fourier transformation to be performed on each 
of the scaled functions. These additional FFTs offset the benefit of performing equations 



and ^ in Fourier space for small p. The cross over is at p=16 which is the limit of stability 
of the FFT approach. In light of this these two equations are performed in real space. It 



is however worth evaluating ^ in Fourier space since the each FFT can be amortized over 
189 transformations. The cross over for chosing direct or Fourier space for step 3b is about 
p=2. For this formulation of the FMM algorithm Lopt is given by 



27(03/02) (7/^ 



(C15) 



^ p'^[{p + iy + s- 189] 



with a Topt{N,p) 



Topt = N 



aip{p + 1) + V(216/7)a2a3p2[(p + 1)2 + 1512] 



(C16) 



Comparing the two optimal timing expressions the FFT approach should be favorable 
for p > 2. 
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TABLES 



N 


Ew {G,Kiax} 


P^M {n,G,Nfe} 


FMM {\eYe\s,\max } 


time (sec) 


^rel.err. 


el. err. 


512 




{5,2.12,24} 




.06 


3x 10-5 


4x 10-5 


512 






{1,7} 


.31 


2x 10-5 


4x 10-5 


512 


{7.74, 8} 






.25 


3x 10-5 


Ix 10-5 


1000 




{5,2.11,30} 




.10 


4x 10-5 


8x 10-5 


1000 






{2,7} 


.54 


5x 10-5 


2x 10-5 


1000 


{10.75, 8} 






.68 


4x 10-5 


4x 10-5 


5000 




{5,2.09,48} 




.48 


Ix 10-5 


3x 10-^ 


5000 






{2,7} 


3.45 


8x 10-5 


3x 10-4 


5000 


(10.32. 11} 






8.02 


8x lO"'"' 


3x 10-5 


10000 




{5,2.09,64} 




1.12 


7x 10-5 


4x 10-4 


10000 






{3,7} 


5.21 


9x 10-5 


7x 10-4 


10000 


{10.32. 11} 






26.0 


Gx lO"'"^ 


9x 10^5 


20000 




{5,2.08,80} 




2.40 


6x 10-5 


2x 10-5 


20000 






{3,7} 


10.01 


7x 10-5 


Ix 10-4 


20000 


{12.90. 11} 






78.0 


5x 10-''^ 


3x 10-5 



TABLE I. Timings for Ewald, P^M, and FMM on periodic systems of various size. Parameters 



used in these runs are given in brackets {. . .}. In P^M {n,G,Nfc} n refers to the assignment function 
used and is the grid size. In FMM {levels,l^aj;} the periodic cell is divided into 8'^^^''' cells. 
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N 


{n,G,Nfe} 


R space 


Make p 


Poisson 


Ej 


Total 


512 


{5,2.12,24} 


.02 


.00 


.02 


.02 


.06 


1000 


{5,2.11,30} 


.04 


.01 


.03 


.02 


.10 


5000 


{5,2.09,48} 


.17 


.06 


.14 


.11 


.48 


10000 


{5,2.09,64} 


.36 


.12 


.38 


.26 


1.12 


20000 


{5,2.08,80} 


.73 


.25 


.95 


.47 


2.40 



TABLE II. Breakdown of the total time for the P^M computations of table I into time spent 
summing the short-range interactions (excluding time spent to construct near neighbor tables), 
time spent assigning the charge to the grid. Solving the Poisson eqn. for the potential and field 
on the grid, and the time spent computing the total potential and interpolating the field to the 
particle locations. 



N 


{levels, \„iax} 


tables 


upward pass 


downward pass 


local 


near field 


Total 


512 


{1,7} 


0.00 


0.02 


0.04 


0.01 


0.24 


0.31 


1000 


{2,7} 


0.02 


0.07 


0.30 


0.02 


0.13 


0.54 


5000 


{2,7} 


0.03 


0.14 


0.31 


0.12 


2.82 


3.45 


10000 


{3,7} 


0.40 


0.57 


2.43 


0.24 


1.53 


5.21 


10000 


{3,7} 


0.42 


0.77 


2.45 


0.48 


5.82 


10.01 



TABLE III. Breakdown of FMM times into the steps described in section 3. The initial 
category, tables, refers to time spent tabulating parentage for the cell hierarchy. 
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Figure Captions 

1. The relative error in the forces, ^Eili(ff - ^^«'=*)2/ ^^^^(f*<'*»')2 for 512 particles 
randomly located in the periodic cell for various assignment schemes plotted vs 1./ "dis- 
cretization" (where "discretization", 1/GA, is proportional to the number of grid points 
under each Gaussian charge). The numbers above the x-axis indicate the number of grid 
points each charge is assigned to in the primitive P^M method (curve labeled S3) 

2. Timings for the k-space part of the Ewald sum performed on NPE=1 up to 128 CRAY 
T3D processors for 10240 particles. 

3. Timings for the PM step of the P^M algorithm versus 1/ number of CRAY T3D pro- 
cessors for 102,400 and 1,024,000 particles. The time spent in assigning the density to the 
grid, solving the Poisson equation, and interpolating the fields to the particles are indicated 
by labels at right. Gaps not labeled involve interprocessor communication times for density 
and field transfers whose sum is shown by the dashed line. A 64^ grid and n=4 assignment 
scheme was used for both cases. 
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